We now have the necessary tools under our belt, ggplot2 and dplyr. As we move forward, these will be the primary tools we use to explore our data. Let’s take a quick look back at the goals of this phase.
Variation within categories
To refresh our memory on our Kickstarter data, we can run some functions to inspect the data again like we did in Phase 1. I’ll just do one here to start with, but feel free to reacquaint yourself with the dataset however you see fit.
ks %>% summary
## id photo name
## Min. :3.703e+06 Length:500 Length:500
## 1st Qu.:5.918e+08 Class :character Class :character
## Median :1.162e+09 Mode :character Mode :character
## Mean :1.124e+09
## 3rd Qu.:1.649e+09
## Max. :2.147e+09
## blurb goal pledged
## Length:500 Min. : 10 Min. : 0.00
## Class :character 1st Qu.: 2000 1st Qu.: 54.75
## Mode :character Median : 5000 Median : 1076.00
## Mean : 12673 Mean : 7434.08
## 3rd Qu.: 12000 3rd Qu.: 5503.58
## Max. :125000 Max. :172586.00
## state slug country
## Length:500 Length:500 Length:500
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## currency currency_trailing_code deadline
## Length:500 Mode :logical Min. :2010-03-07 05:00:00
## Class :character FALSE:60 1st Qu.:2013-09-16 12:26:19
## Mode :character TRUE :440 Median :2015-03-28 19:49:20
## NA's :0 Mean :2014-12-07 00:39:29
## 3rd Qu.:2016-04-16 15:34:53
## Max. :2017-09-19 22:21:48
## state_changed_at created_at
## Min. :2010-03-07 05:00:10 Min. :2010-01-13 17:25:44
## 1st Qu.:2013-09-16 12:26:56 1st Qu.:2013-06-26 11:01:40
## Median :2015-03-28 19:49:21 Median :2015-01-05 03:29:48
## Mean :2014-12-04 11:11:37 Mean :2014-09-21 12:36:46
## 3rd Qu.:2016-04-15 06:05:55 3rd Qu.:2016-02-07 05:27:22
## Max. :2017-08-15 18:01:03 Max. :2017-08-10 17:16:07
## launched_at staff_pick is_starrable
## Min. :2010-01-13 22:35:38 Mode :logical Mode :logical
## 1st Qu.:2013-08-11 02:22:38 FALSE:440 FALSE:488
## Median :2015-02-23 10:40:26 TRUE :60 TRUE :12
## Mean :2014-11-03 00:13:57 NA's :0 NA's :0
## 3rd Qu.:2016-03-15 20:33:17
## Max. :2017-08-15 18:01:02
## backers_count static_usd_rate usd_pledged creator
## Min. : 0.00 Min. :0.05474 Min. : 0 Length:500
## 1st Qu.: 2.00 1st Qu.:1.00000 1st Qu.: 60 Class :character
## Median : 21.00 Median :1.00000 Median : 1100 Mode :character
## Mean : 87.84 Mean :1.00932 Mean : 6632
## 3rd Qu.: 72.25 3rd Qu.:1.00000 3rd Qu.: 5145
## Max. :3399.00 Max. :1.69769 Max. :152604
## location category profile spotlight
## Length:500 Length:500 Length:500 Mode :logical
## Class :character Class :character Class :character FALSE:264
## Mode :character Mode :character Mode :character TRUE :236
## NA's :0
##
##
## urls source_url
## Length:500 Length:500
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Notice the use of the pipe?
So when we are talking about variation within categories, we are talking about breaking the data up into different categories, and seeing how the values of other variables vary for each of the new groups. Looking through the summary of each variable above, we can pick out a few that might be useful for breaking our data into categories. “state” looks like it could hold information about the success of each project and would make a great group of categories. It also looks like the following variables could be useful: country, currency, location, category, spotlight, and is_starrable. What each of these variables have in common is that they are (or can be) factor variables. Factor variables are categorical data. The range of possible values is fixed; those values are known as “levels”. They can be ordered or not. Let’s take a look at the levels of a few of the factor variables mentioned above.
levels(ks$state)
## NULL
The reason we got null here is that R has the state variable is encoded as a character variable. You can scroll up to where we ran the summary function to see that (where it says Class: character). We simply need to tell R to interpret it as a factor. We’ll use the factor() function from the forcats package.
library(forcats)
levels(factor(ks$state))
## [1] "canceled" "failed" "live" "successful" "suspended"
The code above simply shows us what those levels would be if “state” were a factor variable. To make the change permanent, we simply need to reassign the state variable:
ks$state <- factor(ks$state)
We can do the same for several other variables that should be treated as factors. Let’s see what we find out from a few of the other variables. Checking the levels should tell us whether or not they may be good candidates to be treated as factors.
levels(factor(ks$country))
## [1] "AU" "BE" "CA" "DE" "DK" "ES" "GB" "HK" "IT" "MX" "NO" "NZ" "SE" "US"
levels(factor(ks$category))
## [1] "3D Printing" "Academic" "Accessories"
## [4] "Action" "Animation" "Anthologies"
## [7] "Apparel" "Apps" "Architecture"
## [10] "Art Books" "Audio" "Blues"
## [13] "Camera Equipment" "Candles" "Ceramics"
## [16] "Children's Books" "Childrenswear" "Classical Music"
## [19] "Comedy" "Comic Books" "Conceptual Art"
## [22] "Cookbooks" "Country & Folk" "Couture"
## [25] "Digital Art" "DIY" "DIY Electronics"
## [28] "Documentary" "Drama" "Drinks"
## [31] "Electronic Music" "Events" "Experimental"
## [34] "Fabrication Tools" "Faith" "Family"
## [37] "Fantasy" "Farmer's Markets" "Farms"
## [40] "Festivals" "Fiction" "Fine Art"
## [43] "Food Trucks" "Footwear" "Gadgets"
## [46] "Gaming Hardware" "Glass" "Graphic Design"
## [49] "Graphic Novels" "Hardware" "Hip-Hop"
## [52] "Horror" "Illustration" "Indie Rock"
## [55] "Jazz" "Jewelry" "Literary Journals"
## [58] "Live Games" "Makerspaces" "Metal"
## [61] "Mixed Media" "Mobile Games" "Music Videos"
## [64] "Musical" "Narrative Film" "Nature"
## [67] "Nonfiction" "Painting" "People"
## [70] "Performance Art" "Performances" "Periodicals"
## [73] "Photo" "Photobooks" "Places"
## [76] "Playing Cards" "Plays" "Poetry"
## [79] "Pop" "Print" "Printing"
## [82] "Product Design" "Public Art" "Puzzles"
## [85] "Quilts" "R&B" "Radio & Podcasts"
## [88] "Ready-to-wear" "Restaurants" "Robots"
## [91] "Rock" "Romance" "Science Fiction"
## [94] "Sculpture" "Shorts" "Small Batch"
## [97] "Software" "Spaces" "Tabletop Games"
## [100] "Television" "Textiles" "Thrillers"
## [103] "Translations" "Video" "Video Art"
## [106] "Video Games" "Wearables" "Web"
## [109] "Webcomics" "Webseries" "Woodworking"
## [112] "World Music" "Young Adult" "Zines"
levels(factor(ks$is_starrable))
## [1] "FALSE" "TRUE"
levels(factor(ks$spotlight))
## [1] "FALSE" "TRUE"
levels(factor(ks$currency))
## [1] "AUD" "CAD" "DKK" "EUR" "GBP" "HKD" "MXN" "NOK" "NZD" "SEK" "USD"
levels(factor(ks$location))
## [1] "Abingdon, UK" "Acton, MA"
## [3] "Ada, OK" "Albuquerque, NM"
## [5] "Alexandria, VA" "Amarillo, TX"
## [7] "Amsterdam, Netherlands" "Anaheim, CA"
## [9] "Antioch, CA" "Apoteri, Guyana"
## [11] "Arhus, Denmark" "Arlington, VA"
## [13] "Asbury Park, NJ" "Atascadero, CA"
## [15] "Athens, OH" "Atlanta, GA"
## [17] "Auckland, NZ" "Aurora, IL"
## [19] "Austin, TX" "Bakersfield, CA"
## [21] "Baldwin, GA" "Baltimore, MD"
## [23] "Banner Elk, NC" "Barcelona, Spain"
## [25] "Bayside, Queens, NY" "Beaufort, NC"
## [27] "Bellingham, WA" "Bergen, Norway"
## [29] "Berlin, Germany" "Birmingham, UK"
## [31] "Blacksburg, VA" "Bloenduos, Iceland"
## [33] "Bloomington, IL" "Bloomington, IN"
## [35] "Bonn, Germany" "Boston, MA"
## [37] "Bowling Green, KY" "Brisbane, AU"
## [39] "Bristol, UK" "Bronx, NY"
## [41] "Brooklyn, NY" "Brownsville, TX"
## [43] "Buenos Aires, Argentina" "Burbank, CA"
## [45] "Cadillac, MI" "Cambridge, MA"
## [47] "Canberra, AU" "Canterbury, UK"
## [49] "Catonsville, MD" "Cazenovia, NY"
## [51] "Charleston, SC" "Charlotte, NC"
## [53] "Chicago, IL" "Cincinnati, OH"
## [55] "Clayton, OH" "Cleveland, OH"
## [57] "Cologne, Germany" "Colorado Springs, CO"
## [59] "Columbus, GA" "Columbus, OH"
## [61] "Copenhagen, Denmark" "Costa Mesa, CA"
## [63] "Dallas, TX" "Davis, CA"
## [65] "De Kalb, IL" "Decatur, GA"
## [67] "Denver, CO" "Doncaster, UK"
## [69] "Dover, NH" "Downtown Toronto, Canada"
## [71] "Dumfries, VA" "Durango, CO"
## [73] "Durham, NC" "East Lansing, MI"
## [75] "Elk Grove, CA" "Elmwood Park, IL"
## [77] "Erfurt, Germany" "Espa\xed\xb1a, Spain"
## [79] "Evanston, IL" "Federal Way, WA"
## [81] "Fenton, MI" "Flagstaff, AZ"
## [83] "Fort Lauderdale, FL" "Fort Worth, TX"
## [85] "Foyil, OK" "Frankfurt, Germany"
## [87] "Fredericksburg, VA" "Freising, Germany"
## [89] "Funkstown, MD" "Georgetown, TX"
## [91] "Gerlach, NV" "Gig Harbor, WA"
## [93] "Gladstone, OR" "Glencoe, KY"
## [95] "Glenolden, PA" "Gothenburg, Sweden"
## [97] "Grafton, VT" "Granada, Spain"
## [99] "Greater London, UK" "Greater Manchester, UK"
## [101] "Guildford, UK" "Gulfport, MS"
## [103] "Halifax, Canada" "Hamburg, Germany"
## [105] "Harrisburg, PA" "Haugesund, Norway"
## [107] "Heber City, UT" "Helena, MT"
## [109] "Hereford, UK" "Hermosa Beach, CA"
## [111] "High Laver, UK" "Hilton Head Island, SC"
## [113] "Hoi An, Viet Nam" "Hong Kong, Hong Kong"
## [115] "Houston, TX" "Hove, UK"
## [117] "Hsinchu City, Taiwan" "Huntingdon, PA"
## [119] "Huntington Beach, CA" "Idaho Falls, ID"
## [121] "Isleworth, UK" "Jacksonville, FL"
## [123] "Kalamazoo, MI" "Kansas City, MO"
## [125] "Kelowna, Canada" "Kennewick, WA"
## [127] "Ketchum, ID" "Key Biscayne, FL"
## [129] "Kiev, Ukraine" "King of Prussia, PA"
## [131] "Knoxville, TN" "Kortrijk, Belgium"
## [133] "Kosice, Slovakia" "La Salle, IL"
## [135] "Lakeville, MN" "Lakewood, CO"
## [137] "Lancashire, UK" "Lansing, MI"
## [139] "Las Vegas, NV" "Lawton, OK"
## [141] "Lehighton, PA" "Leicester, UK"
## [143] "Leicestershire, UK" "Lexington, KY"
## [145] "Little Rock, AR" "Lombard, IL"
## [147] "London, Canada" "London, UK"
## [149] "Los Angeles, CA" "Louisville, KY"
## [151] "Lynchburg, VA" "Madrid, Spain"
## [153] "Malaga, Spain" "Malm̦, Sweden"
## [155] "Manassas Park, VA" "Manchester, UK"
## [157] "Mandan, ND" "Manhattan, NY"
## [159] "Manila, Philippines" "Martinsburg, WV"
## [161] "Melbourne, AU" "Memphis, TN"
## [163] "Mesa, AZ" "Mestre, Italy"
## [165] "Mexico, Mexico" "Miami, FL"
## [167] "Middleboro, MA" "Minneapolis, MN"
## [169] "Monroe, NC" "Monterey, CA"
## [171] "Monterrey, Mexico" "Montreal, Canada"
## [173] "Mundelein, IL" "Nanaimo, Canada"
## [175] "Naples, FL" "Nashville, TN"
## [177] "New Orleans, LA" "New York, NY"
## [179] "Newport, UK" "Norcross, GA"
## [181] "North Hollywood, Los Angeles, CA" "North Ipswich, AU"
## [183] "North Yorkshire, UK" "Nyack, NY"
## [185] "Oakland, CA" "Oklahoma City, OK"
## [187] "Old Town Stony Plain, Canada" "Omaha, NE"
## [189] "Ontario, CA" "Orimattila, Finland"
## [191] "Orlando, FL" "Oshkosh, WI"
## [193] "Palma de Mallorca, Spain" "Palo Alto, CA"
## [195] "Pasadena, CA" "Peekskill, NY"
## [197] "Pensacola, FL" "Philadelphia, PA"
## [199] "Phoenix, AZ" "Pittsburgh, PA"
## [201] "Placerville, CA" "Plantation, FL"
## [203] "Plymouth, MI" "Portland, ME"
## [205] "Portland, OR" "Portsmouth, UK"
## [207] "Provo, UT" "Queens, NY"
## [209] "Redmond, WA" "Richmond, KY"
## [211] "Richmond, VA" "Riverside, CA"
## [213] "Roscoe, IL" "Royal Oak, MI"
## [215] "Sacramento, CA" "Saddle River, NJ"
## [217] "Sag Harbor, NY" "Salem, NH"
## [219] "Salt Lake City, UT" "San Antonio, TX"
## [221] "San Diego, CA" "San Francisco, CA"
## [223] "San Marcos, TX" "Sandwich, MA"
## [225] "Sandy, UT" "Santa Ana, CA"
## [227] "Santa Clara, CA" "Santa Cruz, CA"
## [229] "Santa Fe, NM" "Santa Monica, CA"
## [231] "Scarborough, AU" "Schaumburg, IL"
## [233] "Scranton, PA" "Scunthorpe, UK"
## [235] "Seattle, WA" "Selma, CA"
## [237] "Seoul, South Korea" "Shanghai, China"
## [239] "Shawnee, KS" "Snowflake, AZ"
## [241] "Somerset, KY" "South Houston, TX"
## [243] "Spirit Lake, IA" "Spokane, WA"
## [245] "Spring Hill, TN" "Springfield, MO"
## [247] "St. Augustine, FL" "St. Louis, MO"
## [249] "St. Paul, MN" "St.-Bruno-de-Montarville, Canada"
## [251] "Staten Island, NY" "Stockholm, Sweden"
## [253] "Sturgis, SD" "Summerside, Canada"
## [255] "Sussex, NJ" "Syracuse, NY"
## [257] "Tampa, FL" "Timmins, Canada"
## [259] "Titusville, FL" "Toronto, Canada"
## [261] "Trento, Italy" "Trondheim, Norway"
## [263] "Tucson, AZ" "Twin Falls, ID"
## [265] "Tyler, TX" "Ukiah, CA"
## [267] "Upland, IN" "Vestnes, Norway"
## [269] "Vigo, Spain" "Vilnius, Lithuania"
## [271] "Waco, TX" "Warner Robins, GA"
## [273] "Washington, DC" "Wayland, NY"
## [275] "Wesley Chapel, FL" "West Monroe, LA"
## [277] "Wheaton, IL" "Whistler, Canada"
## [279] "White River Junction, VT" "Wichita, KS"
## [281] "Wiesbaden, Germany" "Willimantic, CT"
## [283] "Wilmington, DE" "Wilmington, NC"
## [285] "Windsor Locks, CT" "Winona, MN"
## [287] "Woodbury, MN" "Zacatecas, Mexico"
levels(factor(ks$creator))
## [1] "Aaron and Jan Geibel"
## [2] "Abigail Scollay"
## [3] "abode"
## [4] "Acad Version"
## [5] "Adam Geiger"
## [6] "Adam Leech"
## [7] "Adam Marie"
## [8] "Adam Metropolis"
## [9] "Adisa Zvekic"
## [10] "Adrian Allen"
## [11] "Aevi Watches"
## [12] "Ahmad Merheb"
## [13] "Airpaq"
## [14] "Airship Isabella"
## [15] "AJ Sikes"
## [16] "Alan Wood"
## [17] "Alan Yeung"
## [18] "Alexandra"
## [19] "Alexandra Blue"
## [20] "Alexandra Ritchie"
## [21] "AlexHubbell"
## [22] "Ali"
## [23] "Alika Davis"
## [24] "Alisa McCance"
## [25] "Allen Roe"
## [26] "Allison"
## [27] "AmbeRed"
## [28] "amirhossein momen"
## [29] "Amita Nathwani"
## [30] "Andre Johnson"
## [31] "Andrew"
## [32] "Andrew Blossom"
## [33] "Andrew DeChristopher"
## [34] "Andy Levy"
## [35] "Angry Inch Brewing"
## [36] "Ann Marie Coviello"
## [37] "Anthony Djuren"
## [38] "Anthony Piper"
## [39] "Antonio Casasanta"
## [40] "Apartment 5E Theater Company"
## [41] "Ara Gureghian"
## [42] "Ari Rice (deleted)"
## [43] "Ashley Allen"
## [44] "Ashley Carr"
## [45] "Ballistic Studios"
## [46] "Ben B"
## [47] "Benjamin I Bryan"
## [48] "Bernd Ott & Emily Besa"
## [49] "Bill Elgin (deleted)"
## [50] "Billy W. Mitchell"
## [51] "Biscotte Yarns"
## [52] "Blake Louis Hocker"
## [53] "Bob Humphrey"
## [54] "Bobby Choy"
## [55] "Brad Christmann"
## [56] "Brandy Lawhorn"
## [57] "Brian Garber"
## [58] "Brian Hawkins"
## [59] "Brian K. Palmer"
## [60] "Brock DeBoer"
## [61] "Brooke Smith (deleted)"
## [62] "Bucket Siler"
## [63] "Caleb Gave Mathis"
## [64] "Caleb Stephens"
## [65] "Canadian Institute for Czech Music"
## [66] "Carl Rossi"
## [67] "Carlin Adelson"
## [68] "Carly Plasha"
## [69] "Casey Hayes"
## [70] "Cassandra Turner"
## [71] "Cassie McDaniel"
## [72] "Catherine Weiss-Celley"
## [73] "Charles Johnson Jr."
## [74] "Chelsea Hrynick Browne"
## [75] "Chris Andersen"
## [76] "Chris Calzia"
## [77] "Chris Coyne"
## [78] "Chris Matthewman"
## [79] "Christian Bartram"
## [80] "Christian Rosier"
## [81] "Christopher Campbell"
## [82] "Christopher Ciesiel"
## [83] "Christopher Head (deleted)"
## [84] "Christopher Herrera"
## [85] "christopher nicholas"
## [86] "Cineridge Entertainment, LLC."
## [87] "CJP"
## [88] "Clarence Oates"
## [89] "Classy Cake Creations"
## [90] "Claudia Stocker"
## [91] "Codie Cosgrove"
## [92] "Colin Blakely"
## [93] "Colin Momeyer"
## [94] "Communist Daughter"
## [95] "Corey Landen"
## [96] "corey underwood"
## [97] "Cornelius Sullivan"
## [98] "Cyrus Farivar"
## [99] "Dan Phelps CD release"
## [100] "Daniel Eggington (deleted)"
## [101] "Daniel Jensen"
## [102] "Daniel Sanchez"
## [103] "Daniel Tidwell"
## [104] "Dante"
## [105] "Danza-RevistaMX"
## [106] "Darts Connect"
## [107] "DAVE WEISBERG"
## [108] "David Bui"
## [109] "David Cornelson"
## [110] "David Guinn"
## [111] "David J. Morris"
## [112] "David Toledo"
## [113] "David Wanczyk"
## [114] "David White"
## [115] "David Zawacki"
## [116] "Dawn Deason (deleted)"
## [117] "Deborah Walther"
## [118] "Denis and Terri Zafiros"
## [119] "Desiree Turner"
## [120] "Devyn DeLoera"
## [121] "Dia Proimos"
## [122] "Doctor Octoroc"
## [123] "Dorothy Gambrell"
## [124] "Down In Light"
## [125] "Drawing From Heaven"
## [126] "Dreaming City Books (Jim Kirkland Pub.)"
## [127] "Dueling Wizards, LLC"
## [128] "Dustin"
## [129] "Dustin White"
## [130] "Dylan Guffey"
## [131] "easyshower"
## [132] "Ed Galloway Totem Pole Park"
## [133] "Ed Goldberg"
## [134] "Edwin Premberg (deleted)"
## [135] "Eileen"
## [136] "Elizabeth Raybee"
## [137] "Elly Blue"
## [138] "EQUIPT for PLAY"
## [139] "Eric Anderson"
## [140] "Eric Holstein"
## [141] "Eric Jeong"
## [142] "Erik Carl"
## [143] "Erik Kim Malmberg"
## [144] "Evading Azrael"
## [145] "Evanston Escola de Samba"
## [146] "Evelyn Aira"
## [147] "Evelyne Dubois"
## [148] "Evil Girlfriend Media"
## [149] "EXPLOSHIELD Limited"
## [150] "Filippo Sterrantino"
## [151] "Fledge"
## [152] "Flloyd"
## [153] "FOG DOG"
## [154] "Folding Firebox"
## [155] "France Garrido"
## [156] "Freak Show"
## [157] "FrigidFox"
## [158] "Full of Win Games"
## [159] "Gabriel Lubell"
## [160] "Galen Ihlenfeldt"
## [161] "Gary Dressler"
## [162] "Gast\xcc_n Arballo"
## [163] "Gee"
## [164] "Genealogical Society of Pennsylvania"
## [165] "Gizbee LLC"
## [166] "Gozer Games"
## [167] "Greg Adkins"
## [168] "Greg Stolze"
## [169] "Hamish John Appleby"
## [170] "Hannah Harvigsson"
## [171] "Happy Hour Hero3+ Productions (deleted)"
## [172] "Harrison Mead"
## [173] "Harry Herzberg"
## [174] "Heather Craig"
## [175] "Herbie J Pilato"
## [176] "Holly Hunt"
## [177] "Hotep TheArtist"
## [178] "Ian Pudney"
## [179] "Ian Reagan"
## [180] "In-Label Records"
## [181] "IndieCarry"
## [182] "Intermezzo"
## [183] "Ireca Sims"
## [184] "Iryna Kucheryava, James Warwick"
## [185] "Isabel Draves"
## [186] "isaiah lucero"
## [187] "Jack C. Newell"
## [188] "Jacob"
## [189] "Jacob Friedman"
## [190] "Jacob Porter"
## [191] "Jaime Armas"
## [192] "Jaime Wright"
## [193] "Jake Green"
## [194] "James A. Owen"
## [195] "James and Anna (deleted)"
## [196] "James Black (deleted)"
## [197] "James K. Holder II"
## [198] "James Kelley"
## [199] "James Smith"
## [200] "James Tradgett"
## [201] "jami lyn"
## [202] "Jamie"
## [203] "Jamie Bianchini"
## [204] "Jamie Martin"
## [205] "Jamie Plante"
## [206] "Jason Boone"
## [207] "Jason Peach"
## [208] "Jay B"
## [209] "jayblack"
## [210] "Jen Reeves"
## [211] "Jennica Schwartzman"
## [212] "Jennifer Silvey"
## [213] "Jenny Jarnagin"
## [214] "jerome"
## [215] "Jesse Banda"
## [216] "Jesse Manfra"
## [217] "Jesse Robison"
## [218] "Jim Ettwein"
## [219] "Joe Trojnor-Barron"
## [220] "John Alexander Miller"
## [221] "John Berendzen"
## [222] "John C. Henneberg"
## [223] "John Cullen"
## [224] "John Elefante"
## [225] "John Rap"
## [226] "John Santagada"
## [227] "Jon Antcliff"
## [228] "Jonathan"
## [229] "Jonathon High"
## [230] "Jonne Ziengs"
## [231] "Jordan Clark"
## [232] "Josh Bramos"
## [233] "Josh Gray"
## [234] "Joshua Adams"
## [235] "Joshua Emdon"
## [236] "Joshua R. Pinkas"
## [237] "Jozef Karpiel (deleted)"
## [238] "Juli Chavez"
## [239] "Julia M. Doughty / Doug Wood"
## [240] "Julie Renee McCarty"
## [241] "Justice Pirkey"
## [242] "Justin Terveen"
## [243] "J\xcc_rgen Scholz"
## [244] "Kairu Photography"
## [245] "Kait Rhoads"
## [246] "Kara McMaster"
## [247] "Karen Hansen"
## [248] "Karin Pihl"
## [249] "Karina Rocha"
## [250] "Karl Raschke"
## [251] "Kate Bell"
## [252] "Kate Wengier (and kids)"
## [253] "katemilford"
## [254] "Kathryn M Highfield"
## [255] "Kathy Fox - Fox Foods"
## [256] "Keith Newton & Steve Gorman"
## [257] "Kelly Matthews"
## [258] "Kelly Schatz"
## [259] "Ken Avery"
## [260] "Ken Bishop"
## [261] "Kenneth Green"
## [262] "Kenneth Helm"
## [263] "kermit eby lll"
## [264] "Kevin Fishburne"
## [265] "KEVIN HANLEY"
## [266] "Kevin Krysiak"
## [267] "Kevin Maloney"
## [268] "Kevin Shoemaker & Skylar Bennett"
## [269] "Kevis Antonio"
## [270] "Kharis Featuring Kendre Streeter"
## [271] "King Non"
## [272] "Kip Jalal BRITTON"
## [273] "Kirsten Berg"
## [274] "KitchEco by J\xcc\xfcrgen & Jacob"
## [275] "Kurt Vincent"
## [276] "Landfill Dzine"
## [277] "Landon Purser (deleted)"
## [278] "Laura Larson"
## [279] "Laura Preble"
## [280] "Leah @ RogueJewels"
## [281] "Lee Guerringue"
## [282] "Leonard Patton"
## [283] "Lesley Jones"
## [284] "Lew Lefton"
## [285] "Lichie"
## [286] "Lisa Maxwell"
## [287] "Logan Crannell"
## [288] "Lori fraize"
## [289] "Louis Williams"
## [290] "Luis Martmen"
## [291] "Lynn Hershman Leeson"
## [292] "Lynne M. Thomas"
## [293] "Mad Traffic"
## [294] "Major Skinner"
## [295] "MAKI - Games"
## [296] "Mamahuhu"
## [297] "Marcus Bittle"
## [298] "Marina"
## [299] "Mario Sosa"
## [300] "Marissa Quinn"
## [301] "Mark Miko and Istvan Vecsernyes"
## [302] "Mark Shirley"
## [303] "Mark Titus"
## [304] "MarQ P"
## [305] "Marshall Moose Moore"
## [306] "Martin Garan\x80\x8dovsk\xcc_"
## [307] "Marty Allen"
## [308] "Mary Gregory"
## [309] "Mary Kulikowski"
## [310] "Mary Trunk"
## [311] "Mat Coleman"
## [312] "Matt"
## [313] "matt leidecker"
## [314] "Matt Santoli"
## [315] "Max Frost"
## [316] "Melica Bloom"
## [317] "Mercedes Parker (deleted)"
## [318] "Mew Mew & Fluffy LTD"
## [319] "Micaton Ergonomics, S.L."
## [320] "Michael Hanna and Jeff Johnson"
## [321] "Michael James Farmer"
## [322] "Michael Muldoon"
## [323] "Michael Newberry"
## [324] "Michael Okincha"
## [325] "Michael Papathanasakis"
## [326] "Michael Patrick Flanigan Jr."
## [327] "Michael Reilly"
## [328] "Michelle \\\"Gearhead\\\" Haunold"
## [329] "Michelle Tran"
## [330] "Mick"
## [331] "Migo"
## [332] "Mike and Julie"
## [333] "Mina Yoo"
## [334] "Minnesota Dance Collaborative"
## [335] "Mr H"
## [336] "Mustapha"
## [337] "N. L. Kerr"
## [338] "NaDA Publishing"
## [339] "Nadia Karim"
## [340] "Naseem Nossiff"
## [341] "Nate"
## [342] "National Icon"
## [343] "Natures Talk Show LLC Voice Of Nature"
## [344] "Navasota String Band"
## [345] "Neil Meister"
## [346] "NetToons, Inc."
## [347] "Nic Carter"
## [348] "Nick Chiodras"
## [349] "NICOLAS LINARES"
## [350] "Nolan Brundige / NMB Creations"
## [351] "Normal Games Co"
## [352] "NorthsideComedy.com"
## [353] "Oleg Dergachov"
## [354] "Olga Almansky"
## [355] "OPUS High Technology Corp"
## [356] "Orestes Manousos"
## [357] "Oxford American"
## [358] "Parlor Hawk"
## [359] "ParteePartee (deleted)"
## [360] "Patricia Anaya"
## [361] "Patricia Noworol Dance Theater"
## [362] "Patrick C. Simpson-Jones"
## [363] "Patrick Healy"
## [364] "PeaceTones"
## [365] "Pelorus Press"
## [366] "Pensacola Little Theatre"
## [367] "Pete Kolo"
## [368] "Peter Allen"
## [369] "Peter Bond"
## [370] "Peter Sand"
## [371] "Petter Bendiksen"
## [372] "Petunia Tech"
## [373] "Philip Rice"
## [374] "Pitu Sanchez"
## [375] "PLAY AGAIN"
## [376] "PRETTYTHESERIES"
## [377] "Priscilla Aroean"
## [378] "QingYing E&T LLC"
## [379] "Rachelle Robinson"
## [380] "Raistlin"
## [381] "Rame Pizzeria"
## [382] "Randy Rodriguez"
## [383] "Red Scotch Software"
## [384] "Restoration Bid Inc."
## [385] "Rhonda Slone"
## [386] "Richard Tucci"
## [387] "Rishi Sethi"
## [388] "RJ4L"
## [389] "RNDM Design"
## [390] "Robert D. Jansen"
## [391] "Robert James"
## [392] "Robert P. Singleton"
## [393] "Roberto de Farias"
## [394] "Robin Bond"
## [395] "Rocco Panetta"
## [396] "Rodrigo M. Malmsten"
## [397] "Ron Edwards (deleted)"
## [398] "Roschman Dance and Wallis Knot"
## [399] "Row 1 Productions"
## [400] "Ruben Tello"
## [401] "Rudi"
## [402] "Rush Hicks"
## [403] "Ryan Ovadia"
## [404] "Ryan Pietrzak"
## [405] "Sabrina Cotugno"
## [406] "Sam Hayes"
## [407] "Sandra Golden"
## [408] "Scott Drotar"
## [409] "Scott Lost"
## [410] "Scott Thomson"
## [411] "Sean & Emma"
## [412] "Sean Taylor"
## [413] "Selective Perspective Collective"
## [414] "Sentinel Games"
## [415] "Shaina Tantuico"
## [416] "Shamek V Farrah"
## [417] "Shannon Byrne"
## [418] "Shasta Palmer"
## [419] "Shawn French"
## [420] "Sian Wheatcroft"
## [421] "Simon Arvidsson"
## [422] "Simon Harrison"
## [423] "Simon Horrocks"
## [424] "Simon Von Bargen"
## [425] "Simone's Market Stall"
## [426] "Sithari D"
## [427] "Smieszek"
## [428] "Spencer Miskoviak"
## [429] "Stacy Arnold-Strider"
## [430] "Stephanie"
## [431] "Stephanie Law"
## [432] "Stephen Greenberg"
## [433] "Steven (SVen)"
## [434] "Steven Battey (deleted)"
## [435] "Strange Biology"
## [436] "Styrman & Crew"
## [437] "Sukey Molloy"
## [438] "Supreme Clans"
## [439] "Suzanne Brockmann & small or LARGE"
## [440] "Suzy Liebermann"
## [441] "Sven Moss"
## [442] "Swayzee"
## [443] "S\xcc\xfcren F. Fantini"
## [444] "Tam Quoc Tran"
## [445] "Tanya"
## [446] "Taro"
## [447] "Taylor"
## [448] "Tea Silvestre Godfrey"
## [449] "Team Kaiju"
## [450] "Team Playout"
## [451] "The Art of Cool Project"
## [452] "The Beekeepers"
## [453] "The Hanser-McClellan Guitar Duo"
## [454] "The Queen Of England Stole my Parents"
## [455] "The SportPod"
## [456] "Theda Fresques"
## [457] "TheJobJob.com (deleted)"
## [458] "Theo Grimshaw"
## [459] "Theodore Sipes"
## [460] "Thom Turner"
## [461] "thomas mcglone"
## [462] "Thomas Walbert"
## [463] "Thor Platter"
## [464] "Thoren Rogers"
## [465] "Tim Rodriguez"
## [466] "Timothy Blakely"
## [467] "Tony MacGregor"
## [468] "Traces"
## [469] "Travis Greene"
## [470] "Tristan Wiener"
## [471] "Trusty Sidekick Theater Company"
## [472] "Tyler McNamer"
## [473] "Uber and Lyft Driver"
## [474] "Undefined Worship"
## [475] "UNlogical"
## [476] "USA Great Buys, LLC"
## [477] "Vernon Thompson"
## [478] "Veronica Rochelle Raggs"
## [479] "victor franco"
## [480] "Victoria Ann Van Arnam"
## [481] "Victoria Cosplay"
## [482] "Video Daughters"
## [483] "Viktoria Korman"
## [484] "Vincent Amaya"
## [485] "Vision Global"
## [486] "Vyonna Maldonado (deleted)"
## [487] "Wendy Martinez"
## [488] "Werner John"
## [489] "Wes Modes"
## [490] "Whiskey Mother Sucker Productions"
## [491] "Xavier Vargas"
## [492] "Yankee & The Foreigners"
## [493] "Yossra El Said"
## [494] "Zachary Brian Roth"
## [495] "Zhuhai CTC Electronic Co., LTD"
## [496] "Zoe Nicholson"
## [497] "Zona Jennifer"
It printed every level for each of the variables listed as data type “character”. In order to be a true categorical value, or category, there should be substantially fewer categories than there are total observations. Looking back through the output at the number of levels for each factor, all but “location” and “creator” have far fewer levels than total observations, which for our dataset is 500.
n_distinct(levels(as.factor(ks$location)))
## [1] 288
n_distinct(levels(as.factor(ks$creator)))
## [1] 497
Those numbers tell me that a few (very few) of our projects have the same creator, and there are quite a few that share a similar location. We could choose to leave these two as character data (especially creator), but there is some benefit to encoding them as factors despite not being true categories. We can always undo it later if we need to.
ks$country <- factor(ks$country)
ks$category <- factor(ks$category)
ks$is_starrable <- factor(ks$is_starrable)
ks$spotlight <- factor(ks$spotlight)
ks$currency <- factor(ks$currency)
ks$location <- factor(ks$location)
ks$creator <- factor(ks$creator)
We can verify each of our new factor variables:
is.factor(ks$state)
## [1] TRUE
is.factor(ks$category)
## [1] TRUE
is.factor(ks$is_starrable)
## [1] TRUE
is.factor(ks$spotlight)
## [1] TRUE
is.factor(ks$currency)
## [1] TRUE
is.factor(ks$location)
## [1] TRUE
is.factor(ks$creator)
## [1] TRUE
# Or
class(ks$state)
## [1] "factor"
class(ks$category)
## [1] "factor"
class(ks$is_starrable)
## [1] "factor"
class(ks$spotlight)
## [1] "factor"
class(ks$currency)
## [1] "factor"
class(ks$location)
## [1] "factor"
class(ks$creator)
## [1] "factor"
We just changed our working dataset again, so we need to update the codebook again. Let’s head over there now. Don’t forget about “state”, which we encoded in a separate, earlier chunk.
Exploring the “state” Variable
Because the state variable contains information about whether or not each project was successful, it seems like a logical place to start exploring.
levels(ks$state)
## [1] "canceled" "failed" "live" "successful" "suspended"
Let’s take a look at how many of our projects fall into each of the categories.
ks %>%
ggplot(aes(x = state)) +
geom_bar()

ks %>%
group_by(state) %>%
summarise(n())
ks %>%
group_by(state) %>%
summarise(n())
Looking at the chart it’s easy to see that most of our projects fall into either the “failed” or “successful” categories. Thinking through the meaning of these, I think the failed and successful projects are probably of the most interest to us. The projects that are still live likely don’t help us much, since they don’t yet have an outcome. A little research as to why Kickstarter would suspend a project reveals that there are a number of possible reasons, but most can be summed up as a violation of Kickstarter’s rules or terms. Let’s pull up the suspended project to see if we can tell what happened here.
ks %>%
filter(state == "suspended")
So we have a project called Boltivate. What I find most interesting is that the amount pledged was far above the goal at the time the project was suspended. I found Boltivate’s profile on Kickstarter, which is still active at the time I’m putting this together. There is some interesting discussion in the comments section from the day they were suspended and shortly thereafter if you’re interested.
My initial thought had been that suspended projects could potentially be lumped in with the failed projects. With Boltivate having received pledges of over four times its goal, calling it failed doesn’t seem appropriate. I think we should ignore it for our analysis.
I also originally thought that cancelled projects could be included with failed projects. I’m basing this on the assumption that projects’ creators cancel a project when it becomes apparent that it will inevitably fail. Let’s test this assumtion and see if it holds any water. One way we can do that is to compare the amount pledged to the goal.
ks %>%
filter(state == "canceled") %>%
mutate(pledged_to_goal = pledged / goal) %>%
ggplot(aes(pledged_to_goal)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looks like most of the cancelled projects had a pledged to goal ratio of at or very near 0, meaning 0 or very few pledges. Some are higher, but none appear to have met their goal (a ratio of greater than 1). Let’s look a little closer at the more succesful of those projects.
ks %>%
mutate(pledged_to_goal = pledged / goal) %>%
filter(state == "canceled", pledged_to_goal >= .5)
A few things I notice from looking at the entries above: first, the range of when they were cancelled to their respective deadlines varies from a few hours before to a couple weeks. In fact, the most successful project (88% funded) was cancelled well before its deadline. This seems to invalidate my assumption that projects were cancelled when it became inevitable that they would fail.
My personal takeaway is that we should ignore cancelled projects as well since we don’t have much (or possibly any) information about why they were cancelled and don’t want to make generalizations that would be inaccurate in at least some instances.
So let’s focus in on the successful and failed projects. We’ll use the filter function.
(ks <- ks %>%
filter(state == "successful" | state == "failed"))
ks %>%
ggplot(aes(x = state)) +
geom_bar()

ks %>%
group_by(state) %>%
summarise(n())
So we now have a categorical variable with two possible options: “failed” or “successful”.
Based on the question we are trying to answer, we will be building a model in the next phase which attempts to predict or explain success of Kickstarter projects. While actually building the model is the subject of Phase 3, during this phase we are essentially preparing ourselves and our data to build that model. One of our underlying goals is to be thinking of ways to answer our question. Because the variable contains information about the success or failure of each project, it becomes an obvious candidate to be our dependent variable. Being a binary variable, meaning each case falls into one of two categories, we will have the opportunity to build a logistic regression model. Logistic regression models attempt to explain the relationship between a binary dependent variable and one or more independent variables.
When the time comes, however, I will want to build another type of model: one using linear regression. Linear regression uses a continuous, rather than categorical, dependent variable. In this particular situation, this may prove important because it will have the potential to explain degrees of success rather than simply success or failure. For example, if the binary view is taken, projects that barely miss their goal are classified as just as much of a failure as projects that get no funding at all. Similarly, projects that barely meet their goal are viewed the same as projects that exceed their goal by ten fold. Each view may by useful different situations so we will be sure you are comfortable using both.
With the goal in mind of also being able to create a linear regression model, we need to begin thinking about the dependent variable to use. “state” is out because it is categorical. “pledged” is the next obvious candidate, as its value certainly demonstrates success. The problem with the “pledged” variable is that by itself it does not contain enough information to help us. For example, pledges of $1000 would be a massive success, but for others that same $1000 would mean massive failure. Therefore we have to consider the amount pledged relative to each project’s goal.
In order to do so, we will create a new variable that contains the ratio of the amount pledged to the goal for each project. We have actually already used a variable just like this when we were exploring the cancelled projects before we decided to remove them. We just didn’t make it permanent at the time.
To refresh your memory, here is how to create the temporary variable:
ks %>%
mutate(pledged_to_goal = pledged / goal)